Pmem cache rdma security

ABSTRACT

Techniques are described for providing one or more clients with direct access to cached data blocks within a persistent memory cache on a storage server. In an embodiment, a storage server maintains a persistent memory cache comprising a plurality of cache lines, each of which represent an allocation unit of block-based storage. The storage server maintains an RDMA table that include a plurality of table entries, each of which maps a respective client to one or more cache lines and a remote access key. An RDMA access request to access a particular cache line is received from a storage server client. The storage server identifies access credentials for the client and determines whether the client has permission to perform the RDMA access on the particular cache line. Upon determining that the client has permissions, the cache line is accessed from the persistent memory cache and sent to the storage server client.

FIELD OF THE INVENTION

The present invention relates to managing access permissions for data,within a cache, belonging to multiple separate databases.

BACKGROUND

Computing elements, such as workstations and server blades, may requestdata blocks from other “source” computing elements over a network. Thesource computing elements may represent storage servers that provideremote computing elements with shared storage access. The sourcecomputing elements may use a persistent cache (e.g. cache in persistentmemory) to cache copies of the data blocks that are primarily stored inprimary persistent storage (e.g. disk storage).

Persistent caches are generally faster and smaller than primary storage.If a copy of a data block is stored in the persistent cache when arequest for that data block is received, the data block can be returnedfar more quickly from the persistent cache than from primary storage.

In order to manage input/output (I/O) operations from remote computingelements, storage servers may use a message-based storage protocol. Insome cases, the message-based storage protocols may use Remote DirectMemory Access (RDMA) to transfer data between computing elements. RDMAis a technology that allows the network interface controller (NIC) ofthe storage server to transfer data “directly” to or from memory of aremote computing element, that is, transferring the data to or from thememory without involving the central processing unit(s) (CPUs) on theremote computing element.

For example, a remote computing element issues a read request to thesource computing element, the storage server. In response to the readrequest, the storage server uses a disk controller to perform ablock-level read from disk and loads the data into its local memory. Thestorage server then performs an RDMA write to place the data directlyinto an application memory buffer of the remote computing element. UsingRDMA increases data throughput, decreases the latency of data transfers,and reduces the load on the storage server and remote computingelement's CPU during data transfers. Implementations of RDMA protocolsmay use a mapping table, such as a hash table to map data block on-disklocations to their respective locations in persistent cache. The storageserver may use the mapping table to locate requested data blocks in thepersistent cache using their on-disk location, provided by the remotecomputing element. However, security for the mapping table isimplemented at a table level. That is, if the remote computing elementis authorized to access the mapping table, then the remote computingelement may access all records in the mapping table regardless ifrecords are associated with another database within the same cluster asthe remote computing element.

One approach to overcome this security issue is to implement securitymeasures for each data block loaded into the persistent cache. By doingso, a remote computing element may only have access to data blocks thatthe remote computing element is authorized to access. However, theprocessing overhead of registering, deregistering, and managing securityfor each data block in the persistent cache is quite large. The CPU onthe storage server would be adversely impacted by the additionaloverhead from memory registration/deregistration and message processingthat overall performance of the storage server may be adverselyimpacted.

The approaches described in this section are approaches that could bepursued, but not necessarily approaches that have been previouslyconceived or pursued. Therefore, unless otherwise indicated, it shouldnot be assumed that any of the approaches described in this sectionqualify as prior art merely by virtue of their inclusion in thissection.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings:

FIG. 1 is a block diagram that illustrates a multi-node DBMS, accordingto an embodiment of the present invention.

FIG. 2A illustrates the structure of an RDMA table, according to anembodiment of the present invention.

FIG. 2B illustrates a logical hierarchy between cache groups, memoryregions, and cache lines within a persistent memory (PMEM) cache,according to an embodiment of the present invention.

FIG. 3 is a flow chart depicting operations performed upon receiving anon-RDMA access request from a client, according to an embodiment of thepresent invention.

FIG. 4 is a flow chart depicting operations performed upon receiving anRDMA access request from a client, according to an embodiment of thepresent invention.

FIG. 5 is a block diagram depicting a computer system upon which anembodiment may be implemented.

FIG. 6 is a diagram of a software system that may be employed forcontrolling the operation of a computer system according to anembodiment of the present invention.

DETAILED DESCRIPTION

In the following description, for the purposes of explanation, numerousspecific details are set forth in order to provide a thoroughunderstanding of the present invention. It will be apparent, however,that the present invention may be practiced without these specificdetails. In other instances, well-known structures and devices are shownin block diagram form in order to avoid unnecessarily obscuring thepresent invention.

General Overview

Described herein are novel techniques for accessing data blocks over anetwork that are cached in a persistent cache. The techniques make useof a form of persistent byte addressable memory referred to aspersistent memory (PMEM). Data blocks are cached in a pool of buffersallocated from PMEM, a pool of buffers allocated from PMEM beingreferred to herein as PMEM cache. Data blocks may be accessed from thePMEM cache by a remote computing element over a network using remotedirect memory access (RDMA). Transmitting data blocks via RDMA avoidsCPU overhead attendant to performing read staging in order to transmitdata blocks over the network. These techniques are herein referred to asPMEM caching.

Under PMEM caching, in order for a remote computing element, such as astorage server client, to access one or more data blocks in the PMEMcache on a source computing element, such as storage server, the storageserver client needs the memory address of the one or more data blockswithin the PMEM cache. An RDMA table may be stored within the PMEM onthe storage server. The RDMA table identifies the data blocks in thePMEM cache and specifies a location of the cached data blocks within thePMEM cache. The RDMA table is accessible by remote computing elementsusing RDMA. For example, the remote computing element can determine theexistence and location of a data block within the PMEM cache throughRDMA reads of the RDMA table.

In an embodiment, the RDMA table may contain a plurality of records,each record corresponding to one or more cache lines within the PMEMcache. A cache line represents a unit of data that contains one or morecached data blocks. Each record in the RDMA table represents a cacheline in the PMEM cache. Each record contains mapping that maps remoteaccess keys to: cache addresses for cached data blocks in the PMEM cacheand their corresponding persistent storage address with the storageserver. The storage server may allow access to cached data blocks uponreceiving an RDMA access request from a storage server client. An RDMAaccess request, generated by the storage server client, may include aremote access key, which is associated with the target cache line to beaccessed from the PMEM cache, and access credentials that are used bythe storage server to determine whether the storage server client haspermission to access the target cache line from the PMEM cache. In anembodiment, the remote key and the access credentials are used todetermine whether the storage server client has valid permissions toaccess the target cache line. Both the access credentials and the remotekey are needed to access data in the PMEM cache. The access credentialsare used by the storage server to determine whether the storage serverclient has permissions to access a set of cache lines allocated to aparticular client in the PMEM cache. The remote key is used by thestorage server to locate a particular record in the RDMA table in orderto identify the cache location of the target cache line in the PMEMcache.

In order for the storage server client to generate an RDMA accessrequest, the storage server and the storage server client may establisha secure RDMA connection for communicating RDMA access requests andresponses. Upon establishing the RDMA connection, the storage serversends RDMA connection information to the storage server client, whichincludes a copy of a portion of the RDMA table that contains recordscorresponding to cache lines currently assigned to the storage serverclient. Other portions of the RDMA table containing records thatreference other cache lines assigned to other clients are not sent tothis particular storage server client.

In an embodiment, the storage server may determine how to allocate cachelines to various storage server clients based upon prior data requestsfrom storage server clients. For example, the storage server maydetermine that particular sets of data are being requested by thestorage server client using non-RDMA access requests, and may then storethe requested sets of data within the PMEM cache for subsequent requestsusing RDMA. Upon loading data for the storage server client into thePMEM cache, the storage server may send the copy of the portion of theRDMA table, which contains the records of cache lines assigned to thestorage server client, to the storage server client. The copy of theportion of the RDMA table may be used by the storage server client todetermine which data is stored in the PMEM cache and when to generate anRDMA access request.

When determining whether to generate an RDMA access request, the storageserver client may analyze the local copy of the portion of the RDMAtable to determine whether data to be requested, herein referred to astarget data, is stored within the PMEM cache. For instance the storageserver client may search for the target data, in the portion of the RDMAtable, using the persistent storage address of the target data. If thetarget data is found within the local copy of the portion of the RDMAtable, then the storage server client may identify the remote key thatis mapped to the persistent storage address of the target data andgenerate an RDMA access request for the target data that includes themapped remote key. The RDMA access request also includes the accesscredentials for the storage server client. In an embodiment, the accesscredentials for the storage server client may be derived from databaseand database cluster identifiers that the RDMA access request originatedfrom.

The storage server may receive the RDMA access request from the storageserver client and determine whether the storage server client haspermissions to access the target data from the PMEM cache. The storageserver determines whether the access credentials provided by therequesting storage server client match the access credentials associatedwith the cache line to be accessed. If the access credentials match,then the storage server determines whether the remote key from the RDMAaccess request identifies a particular cache line. If the remote keymatches the particular cache line, then permission to access the cacheddata blocks in the PMEM cache is granted. If however, the accesscredentials provided by the requesting storage server client do notmatch the access credentials associated with the cache line to beaccessed or the remote key provided in the RDMA access request does notcorrespond to the particular cache line, then the storage server clientis not granted access to the cached data blocks.

Illustrative DBMS

PMEM caching is illustrated in the context of a DBMS. A DBMS comprisesat least one database server. The database server is hosted on at leastone computing element and stores database data in block mode storagedevices. The block mode storage devices may be one or more disk drivesand flash drives connected via a high speed bus of the computing elementto the one or more hardware processors (“processors”) of the computingelement and/or memory of the computing element. A block mode storagedevice may also be a network enabled storage device that is connectedvia a network to the computing element and that compromises other blockmode storage devices such as disk drives and flash drives.

More powerful DBMSs are hosted on a parallel processer hardwareplatform. Such DBMSs are referred to herein as multi-node DBMSs. Amulti-node DBMS comprises multiple computing elements referred to hereinas computing nodes. Each computing node comprises a hardware processoror multiple hardware processors that each shares access to the same mainmemory.

FIG. 1 is a block diagram that illustrates a multi-node DBMS, in anembodiment. The multi-node DBMS 100 comprises a database cluster withmultiple computing nodes, each hosting one or more database serverinstances, and a storage cell implemented to provide remote storage forthe one or more databases within the multi-node DBMS 100. Each databaseserver instance for a particular database provides access to one or moredatabases stored on storage cell 120. The database server instances ofDBMS 100 comprise database server instances 103-1 a, 103-1 b, and 103-2a. Database server instances 103-1 a and 103-2 a are hosted on computingnodes 102-1 and 102-2 respectively and manage access to a firstdatabase, referred to as DB1. Database server instance 103-1 b, hostedon computing node 102-1, manages access to a second database referred toas DB2. Each of the databases DB1 and DB2 are within the same cluster,database cluster 1. Each of the database server instances 103-1 a, 103-2a, and 103-1 b are connected to the storage cell 120 by a high speednetwork 101. Database processes running within database server instances103-1 a, 103-2 a, and 103-1 b are storage device clients of storage cell120. According to an embodiment, DB1 and DB2 are pluggable databases(PDBs) within a container database managed by a container DBMS. Thecontainer DBMS includes database server instances 103-1 a, 103-2 a, and103-1 b. Database server instances may host database sessions for any ofPDBs within the container. In addition, database service processes, suchas log writers, may be storage device clients. Container DBMS isdescribed in “Isolated Hierarchical Runtime Environments forMulti-Tenant Databases”, U.S. application Ser. No. 16/165,996, filed bySantosh Shilimkar, et al., on Oct. 19, 2018, the entire contents ofwhich are incorporated herein by reference.

Storage cell 120 represents a source computing element, such as astorage server, that includes persistent storage (e.g. disk and PMEM)for storing “database files” of the one or more databases in the DBMS100. Storage cell 120 includes persistent storage 129 and main memory127. Persistent storage 129 may comprise persistent storage devices suchas disk devices or flash memory devices. Main memory 127 representsvolatile RAMS. Storage process 125 represents a process that receivesrequests from any of the database server instances 103-1 a, 103-1 b, and103-2 a to read or write data blocks to or from database files stored inpersistent storage 129. Storage process 125 is further discussed in theSTORAGE SERVICES section herein. Volatile buffer pool 128 is a bufferpool allocated from main memory 127. Volatile buffer pool 128 comprisesbuffers used for temporarily staging and/or caching of data blocksstored in persistent storage 129.

Storage cell 120 also includes non-volatile RAM memories PMEM 123. ThePMEM 123 includes PMEM cache 121 and RDMA table 122. The PMEM cache 121is a cache allocated from PMEM 123 and comprises buffers that are usedfor temporarily staging and/or caching data from persistent storage 129.Once data is added to the PMEM cache 121, it can be used to satisfysubsequent read requests for data. Eventually, cached data must beremoved from the PMEM cache 121 to make room for other data. To selectand remove data to make room for other data to be cached, various cachemanagement policies and techniques may be used, such as Least RecentlyUsed algorithms or any data caching policy. Cache manager 126 is aprocess responsible for performing cache management of the PMEM cache121.

Database Server Instances

Each of the database server instances of DBMS 100 comprise databaseprocesses that run on the computing node that hosts the database serverinstance. A database process may be, without limitation, a processrunning within a database session that executes database commands issuedwithin the database session or a query execution process belonging to apool of processes that is assigned to execute queries issued throughdatabase sessions.

Referring to FIG. 1, each of database server instances 103-1 a, 103-1 b,and 103-2 a comprise database processes and database buffers that cachedata blocks read from storage cell 120. Database server instances 103-1a, 103-1 b, and 103-2 a are hosted on computing nodes 102-1 and 102-2.Database server instance 103-1 a comprises DB processes 105-1 a anddatabase buffer pool 108-1, which is allocated from main memory 104-1.Database server instance 103-1 b comprises DB processes 105-1 b anddatabase buffer pool 108-1. Database server instance 103-2 a, hosted oncomputing node 102-2, comprises DB process 105-2 a and database bufferpool 108-2, which is allocated from main memory 104-2. Each of thedatabase server instances 103-1 a, 103-1 b, and 103-2 a may includeadditional DB processes (not shown).

RDMA

Network adapters 109-1, 109-2, and 124 connect computing nodes 102-1,102-2 and storage cell 120, respectively, to network 101. Networkadapters 109-1, 109-2, and 124 may comprise any type of network adapterthat supports RDMA operations, which are herein referred to asRDMA-enabled network interface controller (RNICs). Example RNICsinclude, without limitation, InfiniBand host channel adapters (HCAs).Network adapters 109-1, 109-2, and 124 implement the RDMA protocol(RDMAP) above a reliable transport. Thus, network adapters 109-1, 109-2,and 124 may support protocols including, without limitation, RDMA overconverged Ethernet (RoCE), the Internet Wide Area RDMA Protocol (iWarp),and the Infiniband protocol.

Network adapters 109-1, 109-2, and 124 may be associated with an RNICinterface (RI), where the RI is the presentation of the RNIC to aconsumer as implemented through the combination of the RNIC device andan RNIC device driver. An RNIC device driver represents a computerprogram configured to translate commands between the operating systemand the computing language used by the RNIC device. A consumer, as usedherein may be any computer system process, such as application processesand operating system kernels, that communicate with the RNIC through theRI. Examples of computer system processes that communicate with the RNICusing RI include storage services 112-1, 112-2, and storage process 125.The RI may be implemented as a combination of hardware, firmware, andsoftware. A kernel may expose some or all of the RI functionality toapplication processes through one or more Application ProgrammingInterfaces (APIs).

Processing performed by network adapters 109-1, 109-2, and 124 may beperformed by an RNIC engine. The RNIC engine offloads the processing ofRDMA messages from CPUs on computing nodes 102-1, 102-2, and storagecell 120 onto their respective network adapters 109-1, 109-2, and 124.The implementation of the RNIC engine may vary from implementation toimplementation. For example, the RNIC engine may be included in anintegrated circuit or some other hardware on a network adapter card. Inanother embodiment, the network adapters may include special purposeprocessors with a limited instruction set for processing RDMA messagesand performing direct memory accesses.

RNIC engines, implemented by network adapters 109-1, 109-2, and 124, mayestablish connection queues between each other for the purpose ofsending RDMA access requests and receiving result sets of datacorresponding to the RDMA access requests. For example, InfinitiBandprotocol uses a queue pair comprising a send queue and a receive queuefor the purposes of transferring data between client computing nodes andstorage cells. The send queue may be used to send data requests fromcomputing node 102-1 to storage cell 120, while the receive queue may beused, by the computing node 102-1, to receive the requested data fromthe storage cell 120. In an embodiment, the queue pairs, onceestablished, remain open and may be used for subsequent RDMA accessrequests until the queue pair is terminated by either the computing node102-1 or the storage cell 120. Queue pairs may be terminated if arequest for data is not authorized. For example, if a client oncomputing node 102-1 requests data from PMEM cache 121, where therequested data is not authorized for access by the client on 102-1, thenthe storage cell 120 may deny the request by terminating the open queuepair between computing node 102-1 and storage cell 120. By terminatingthe queue pair, the storage cell 120 may ensure that future unauthorizedrequests are not received by the client.

RDMA Table

In an embodiment, RDMA table 122 stores records of data cached in thePMEM cache 121. The records in the RDMA table 122 specify the memoryaddress of the cached data in persistent storage 129, the cache memoryaddress of the cached data, and a remote key assigned to the specificclients for the cached data. The RDMA table 122 may be an RDMA hashtable. In an embodiment, storage cell 120 stores the RDMA table 122 inPMEM 123. Optionally, the storage cell 120 may store a copy of the RDMAtable 122 in the volatile buffer pool 128, which may be used by thestorage process 125 for reads and updates. The RDMA table 122 containsreferences to sets of cached data blocks stored in the PMEM cache 121and specifies the storage location within the PMEM cache 121 (“cachelocation”) of each cached data block in the PMEM cache 121. A cachelocation may refer to a memory address of a buffer within the PMEM cache121 or an offset from a base memory address of the PMEM cache 121. Thebase memory address of a data structure is the memory address of thebeginning of the region of memory at which a data structure is stored.

Data stored in the PMEM cache 121 are accessed and stored in terms ofcache lines. A cache line is a unit of data transferred between volatilebuffer pool 128 or persistent storage 129 to the PMEM cache 121. In anembodiment, a cache line represents 64 KB of memory. In otherembodiments, cache lines may represent larger or smaller sizes. If aclient wants to read a target data block, within a particular cacheline, the client will then request to read the entire cache line inorder to identify and read the target data block. In an embodiment, anextent represents a memory region, within the PMEM cache 121, allocatedto a specific client. An extent represents an allocation unit of one ormore contiguous cache lines. For example, extents may represent 2 MBs(32 cache lines) of contiguous space within the PMEM cache 121. In anembodiment, the size of extents within the cache, are uniform, that is,only one extent size may be configured for the PMEM cache 121. One ormore extents may be allocated to a client at a given time. In yet otherembodiments, differently sized extents, within the cache, may beconfigured for the PMEM cache 121.

A protection domain is used to identify a group of one or more extentsthat have similar security requirements for a first client, where thefirst client may represent one or more instances of a DB or multiplePDBs within a cluster. For instance, the protection domain may berepresented by a first protection domain identifier. The firstprotection domain identifier may be assigned to the first client and maybe used by the first client to access cache lines, within the one ormore extents, that the first client has privileges to access. Othercache lines, within the PMEM cache 121, that are not part of the one ormore extents assigned to the first protection domain identifier of thefirst client are not accessible, nor viewable, by the first client. Thatis, the first client, using the assigned first protection domainidentifier, is only aware of the cache lines associated with the firstprotection domain identifier. A second client, assigned to a secondprotection domain identifier, may only view and access cache lines thatare associated with the second protection domain identifier. Cache linesassociated with either the first protection domain identifier or anotherprotection domain identifier are not accessible by the second client.

FIG. 2A illustrates the structure of the RDMA table 122 according to anembodiment of the present invention. Referring to FIG. 2A, RDMA table122 includes records 202-1, 202-2, 202-3 through 202-N that referencecache lines stored in the PMEM cache 121. According to an embodiment,each cache line comprises an array of elements that are storedcontiguously (within a memory address space) of PMEM cache 121. Each ofthe records in the RDMA table 122 contain an array of elementsreferencing the elements in the corresponding cache lines. For example,record 202-1 comprises elements 204-1, 204-2, 204-3 through 204-N. Eachelement corresponds to a cached data block cached in PMEM cache 121 andincludes at least the following attributes:

Home Location: Specifies the Home Location of the corresponding cacheddata block. The term “home location” refers to the storage location of adata block in persistent storage and not to the storage location oraddress within the PMEM cache 121 or other buffer pool that is used totemporarily cache data blocks from the persistent storage.

Cache Location: Storage Location of the cached data block in the PMEMcache 121.

Remote Key: Is an access identifier generated when the storage cell 120registers a specific memory region, of one or more extents, to aparticular protection domain. The remote key is passed to the clientassociated with the particular protection domain and is used by theclient to locate and access the specific memory region associated withthe particular protection domain. For example, if a client on computingnode 102-1 is assigned the particular protection domain, then storageprocess 125 would provide the remote key to the client on computing node102-1 and the client may access the corresponding memory region from theRDMA table 122 using RDMA operations generated and sent to storage cell120 by the RNIC on network adapter 109-1.

Valid Flag: A flag that indicates whether or not the information in theelement is valid or invalid. When valid, the information is accurate andmay be relied upon for a period of time referred to as an expirationperiod.

Records 202-1, 202-2, 202-3 through 202-N in the RDMA table 122 areillustrated with their associated protection domains however, theprotection domains are not part of the physical records within the RDMAtable 122. The protection domains shown in RDMA table 122 are used toillustrate which cache lines are associated with which protectiondomains. For example, protection domain identifier C1.DB1 is associatedwith cache lines 202-1 and 202-2. Protection domain identifier C1.DB2 isassociated with cache line 202-3. Protection domain identifier C1.DB3 isassociated with cache line 202-N. If client A is assigned the protectiondomain identifier C1.DB1 then client A, once an RDMA connection betweenclient A and storage cell 120 is established, may access the RDMA table122. Client A's access of the RDMA table 122 would only include accessto the cache lines represented by records 202-1 and 202-2. The remainingcache lines represented by records 202-3 and 202-N would not beaccessible by client A as client A's protection domain identifier(C1.DB1) is only associated with the cache lines represented by records202-1 and 202-2. Similarly, if client B is assigned the protectiondomain identifier C1.DB2, then client B's access to the RDMA table 122would only allow client B to access the cache line represented by record202-3.

In an embodiment, storage cell 120 provides copies of portions of theRDMA table 122 to storage server clients so that the storage serverclients may look up remote key values that are associated withparticular cache lines for the purposes of generating RDMA accessrequests. Specifically, the storage cell 120 sends a set of records fromthe RDMA table 122 that correspond to a particular protection domainidentifier assigned to a particular storage server client. For example,if client A is assigned to protection domain identifier C1.DB1 thenclient A would be sent a copy of the set of records corresponding tocache lines 202-1 and 202-2, which are associated with the C1.DB1protection domain identifier, by the storage cell 120.

Storage server clients may use stored local copies of the portions ofthe RDMA table 122 to generate RDMA access requests and send thegenerated requests to the storage cell 120 using the RDMA connectionqueues. For example, upon receiving a copy of the portion of the RDMAtable 122, client A, which has an instance of DB1 of cluster 1 running,may store the copy of the portion of the RDMA table 122 in main memory104-1. Client A, when seeking to retrieve data (“target data”) stored inthe storage cell 120, may reference a data dictionary, stored in mainmemory 104-1, to determine the home location of the target data. ClientA may then scan the local copy of the portion of the RDMA table 122using the home location of the target data to determine whether thetarget data is currently loaded in the PMEM cache 121 in storage cell120. If client A determines that a record in the portion of the RDMAtable 122 corresponds to the home location of the target data, then thetarget data is in the PMEM cache 121 and client A may retrieve theremote key assigned to the target data from the record in the portion ofthe RDMA table 122. Client A may use the remote key associated with therecord of the target data to generate an RDMA access request. Ifhowever, the target data is not in the local copy of the portion of theRDMA table 122, then the target data is not currently loaded in the PMEM121 and client A would have to generate and send a non-RDMA accessrequest to the storage cell 120.

Cache Groups

A cache group represents a logical grouping of memory regions of cachelines in the PMEM cache 121 that have been assigned to a particularclient. FIG. 2B illustrates a logical hierarchy between cache groups,memory regions, and cache lines within the PMEM cache 121. In anembodiment, PMEM cache 121 may contain cache lines assigned to multipledifferent cache groups. A cache group may represent one or more specificclients, where a client may be defined as one or more database processesaccessing a particular database in a specific database cluster. Forexample, computing node 102-1 may include multiple cache groupsrepresenting multiple clients since computing node 102-1 hosts databasecluster 1 that includes DB1 and DB2. Cache group 250-1 represents afirst cache group that includes database processes running withindatabase sessions for DB1 from database cluster 1 and cache group 250-2represents a second cache group that includes database processes runningwithin database sessions for DB2 from database cluster 1.

Cache groups may be associated with an identifier, referred to as aprotection domain. In an embodiment, protection domains are valuesderived from a database cluster identifier and a database identifier.For example, the protection domain for cache group 250-1 is C1.DB1,where C1 represents the cluster identifier and DB1 represents thedatabase identifier for database 1 in cluster 1. The protection domainfor cache group 250-2 is C1.DB2, where C1 represents the clusteridentifier and DB2 represents the database identifier for database 2 incluster 1.

In an embodiment, each cache group may be assigned one or more memoryregions. For example, memory region 260-1, 260-2, and 260-N representmemory regions assigned to cache group 250-1. Each memory regioncontains a set of contiguous cache lines. For example, cache linescorresponding to records 202-1 and 202-2 represent contiguous cachelines within memory region 260-1.

In an embodiment, cache lines within memory regions assigned to aparticular cache group are only accessible and visible to the particularclient represented by the particular cache group. For example, records202-1 and 202-2 belong to cache group 250-1 and record 202-3 is assignedto cache group 250-2. A client represented by cache group 250-1 can onlyview records in RDMA table 122 assigned to cache group 250-1, whichincludes records 202-1 and 202-2. A client represented by cache group250-2 can only view records in RDMA table 122 assigned to cache group250-2, which includes record 202-3. Referring to FIG. 2B, cache group250-2 contains the assigned memory region 261-1, which contains thecache line corresponding to record 202-3.

In some embodiments, multiple databases within a cluster may be assignedto a single cache group. This may occur when particular databases withina cluster do not have a specific cache management policies and have thesame level of security between them. For example, multiple databaseswithin the cluster may share database information, such as sharingdatabase tables, such that security between the multiple databases doesnot require exclusive access to data in cache lines associated with eachof the multiple databases within the cluster. As a result, a cache groupthat represents multiple databases within a cluster may be assigned asingle protection domain to represent the multiple databases within thecluster. For example, cache group 250-N may represent a group ofdatabases that share cache lines within the PMEM cache 121. The cachegroup 250-N may be identified with protection domain, C1.DB0, whichrepresents the multiple databases within cluster 1 that have the sameaccess privileges to specific data.

In an embodiment, storage process 125, on storage cell 120, isimplemented to allocate memory regions in the PMEM cache 121 to clientsby assigning a protection domain to a requesting client and allocatingone or more memory regions to the requesting client. A protection domainhas a one-to-one relationship to a cache group, such that each cachegroup is mapped to a single protection domain identifier. For example,storage process 125 may receive a request, from computing node 102-1,for data stored in the storage cell 120. Storage process 125 maydetermine whether that the requested data should be loaded into the PMEMcache 121. By loading the requested data into the PMEM cache 121,subsequent data requests from the client may be requested using an RDMAaccess request, thereby bypassing involvement of the CPU on the storagecell 120. Upon determining that the requested data may be loaded intothe PMEM cache 121, the storage process 125 may register the client byassigning a protection domain identifier to the client. The protectiondomain identifier may be based on the requesting client's cluster ID anddatabase ID. The generated protection domain identifier is then storedwithin a cache group data structure, which may be a table or otherobject, configured to keep track of clients, cache groups, and theircorresponding protection domain identifiers. The storage process 125allocates a memory region for the requesting client by communicatingwith the cache manager 126 to determine an available set of contiguouscache lines in the PMEM cache 121 that may be assigned to the requestingclient, where the available set of contiguous cache lines would make upa new memory region. Once the new memory region has been identified, thestorage process 125 may update the RDMA table 122 to reflect theallocation of the memory region to the requesting client. The storageprocess 125 may then load the requested data for the requesting clientinto one or more of the newly allocated cache lines of the new memoryregion in the PMEM cache 121.

The storage process 125 may then send the requesting client therequested data, RDMA connection information that includes the addressfor the RDMA table 122, a copy of a portion of the RDMA table 122 thatincludes the records assigned to the requesting client. The recordswithin the portion of the RDMA table 122 include remote keys forremotely accessing the RDMA table 122 and any other information neededby the client to generate and send subsequent data requests using RDMA.For example, if storage process 125 allocated a memory region containingcache lines represented by records 202-1 and 202-2 to the clientrepresented by cache group 250-1, then storage process 125 would sendinformation containing the address for the RDMA table 122 and a copy ofa portion of the RDMA table 122, that contains the records 202-1 and202-2, to the client. Once an RDMA connection between the client andstorage cell 120 is established, the client may access the cache linesin the PMEM cache 121 by generating RDMA access requests that includethe remote keys corresponding to the records in the copy of the portionof the RDMA table 122 provided to the client. The client's access tocache lines in the PMEM cache would only be limited to cache linescorresponding to records 202-1 and 202-2, as these are the only recordsassigned to the client based on the associated protection domainidentifier. RDMA access requests generated by the client may alsoinclude the associated protection domain identifier. The associatedprotection domain identifier is not included in the copy of the portionof the RDMA table 122 that is stored on the client but, is insteadderived by the client based upon the client's DB identifier and databasecluster identifier. For example, if the client represents database DB1within cluster C1, then the client, when generating the RDMA accessrequest, may generate the protection domain identifier as “C1.DB1”.

Cache Management Policy

Each cache group has its own cache management policy that specifies howdata within their allocated memory regions are managed. For example, acache group policy may specify a maximum total cache size of 10gigabytes for the memory regions assigned to a particular client. If arequest is received, by the particular client, to load additional datainto the PMEM cache 121, which is above the maximum size cap, then thecache manager 126 may reuse existing cache lines to load the requesteddata. Reuse of cache lines refers to invalidating existing data withinone or more cache lines, for the particular client, and overwriting theexisting data with the requested data.

In an embodiment, the cache manager 126 is implemented to manage andenforce cache management policies for the cache groups. The cachemanager 126 determines when to remove data from the PMEM cache 121 inorder to load newly requested data based upon the cache managementpolicies associated with each cache group. In an embodiment, each cachegroup has its own cache group policy that defines rules for cache linemanagement. Cache group policies may define when data in cache lines areto be reused so that new data may be loaded into the cache lines. In anembodiment, the cache management policies may specify whether aparticular cache group has a minimum size requirement, a maximum sizerequirement, or a variable size requirement. A minimum size requirementspecifies a minimum size of allocated memory regions that are assignedto a cache group at any particular time. For example, cache group 250-1may have a minimum size requirement of 5 gigabytes which means that atany given time, cache group 250-1 should have memory regions within PMEMcache 121 that at least equal a size of 5 GB. This may be necessary ifthe client for cache group 250-1 has certain service-level agreements(SLAs) that require a minimum amount of PMEM cache 121 storage in orderto meet response demand.

Storage Services

To initiate a data block read operation for a data block from a blockenabled storage device, a database process running within a databaseserver instance needs to determine the home location of the data blockwithin the persistent storage 129 of the storage cell 120, such as theblock address within a flash memory or a disk offset on a particulardisk. To make this determination, a DBMS maintains mapping data within adata dictionary that specifies which database files hold data blocks forwhich database tables, and uses a storage service that maps databasefiles and offsets within the database files to home locations inpersistent storage. Each database server instance of DBMS 100 may storea copy of the mapping data within volatile RAM for quick access.

In an embodiment, each computing node of DBMS 100 hosts a storageservice. Referring to FIG. 1, computing node 102-1 hosts storage service112-1. Storage service 112-1 comprises one or more storage processes andstorage layer 106-1. Similarly, computing node 102-2 hosts storageservice 112-2, which comprises one or more storage processes and storagelayer 106-2. A storage layer includes software and associated storagemetadata that describes how database files are stored on various storagedevices, such as disks and other persistent memory. The storage layersoftware is executed by storage processes within storage services 112-1and 112-2.

In order for a process on computing node 102-1 and/or 102-2 to send RDMAaccess requests, the process needs information about the RDMA table 122,such as the storage location of the table, as well as RDMA connectionqueue information, such as queue pair information describing activequeues between the computing node and the storage cell 120. Suchinformation, once obtained, is stored in a dedicated area of main memory104-1 and 104-2. For example, table configuration data may be stored inmain memory 104-1 of computing node 102-1. Table configuration datacontains information about RDMA table 122. Among the informationcontained in table configuration data is the base memory address of RDMAtable 122, the memory size of the cache lines, as well as a local copyof a portion of the RDMA table 122 containing records of cache linesassigned to a client. For example, if the client represents DB1 incluster 1 (protection domain C1.DB1), then the local copy of the portionof the RDMA table 122 would only include records corresponding to cachelines assigned to DB1 in cluster 1.

The table configuration data may be provided, by the storage cell 120,to computing node 102-1 and 102-2 upon establishing an RDMA connectionwith storage cell 120. In an embodiment, storage process 125 may receivenon-RDMA access requests and, in response, may establish an RDMAconnection with computing node 102-1 for receiving subsequent dataaccess requests via RDMA. For example, storage process 125, uponprocessing a data access request, may determine that the requested datashould be loaded into the PMEM cache 121. The storage process 125 mayload the requested data into the PMEM cache 121, establish RDMAconnection queues between the storage cell 120 and the computing node102-1, and then provide the requested data to the computing node 102-1along with the table configuration data for the RDMA table 122 and theRDMA connection information. Additional detail for processing non-RDMArequests is described in the NON-RDMA ACCESS REQUEST section herein.

If the table configuration data for the RDMA table 122 already exists inmain memory 104-1 and an active RDMA connection exists for computingnode 102-1, then the storage service 112-1 may generate an RDMA accessrequest, provided that the requested data is in the PMEM cache 121. Asdescribed, the table configuration data includes a local copy of aportion of the RDMA table 122 corresponding to the cache group to whichthe client belongs. For instance, if the client is associated with cachegroup 250-1 (protection domain identifier C1.DB1), then the local copyof the RDMA table 122 would contain records for all cache lines assignedto C1.DB1, which includes records 202-1 and 202-2. In an embodiment, thestorage service 112-1 may determine whether the requested data is in thePMEM cache 121 by determining whether there is a cache hit by scanningthe local copy of the RDMA table 122. If there is a cache hit, then thestorage service 112-1 may generate an RDMA access request and send theRDMA access request to the storage cell 120 using the RDMA connectionqueues. If however, there is not a cache hit on the local copy of theRDMA table 122, then the storage service 112-1 may generate a non-RDMAaccess request, and send the non-RDMA access request to the storageprocess 125 on the storage cell 120.

For example, DB process 105-1 a, representing cluster 1 DB1, may requestdata from storage cell 120 by sending a request to storage service112-1. Storage service 112-1 may receive the request from the DB process105-1 a and may convert the request into either an RDMA access requestor a non-RDMA access request, depending upon whether an RDMA connectionhas been established and whether the request data is already stored inthe PMEM cache 121.

Non-RDMA Access Requests

In an embodiment, a non-RDMA request represents a data access requestsent via the network 101 and received by the storage process 125 onstorage cell 120. Non-RDMA requests may be generated when there is noactive RDMA connection between computing node 102-1 and storage cell120.

FIG. 3 is a flow chart depicting operations performed upon receiving anon-RDMA access request, according to an embodiment. Receiving anon-RDMA access request operation is illustrated using DB process 105-1a on computing node 102-1 and storage process 125 on storage cell 120.DB process 105-1 a may initiate a read or write operation to either reada data block or write to a data block. Storage service 112-1 may receivea request from DB process 105-1 a and determine the home location forthe data block. Once the home location for the data block is determined,storage service 112-1 may determine whether the requested data may beaccessed using an RDMA access request. As described, in order for anRDMA access request to be issued, an active RDMA connection needs to beavailable between computing node 102-1 and storage cell 120 and therequested data needs to be available in the PMEM cache 121. The storageservice 112-1 may check whether an active RDMA connection exists bychecking the table configuration information stored in main memory104-1. If an active RDMA connection exists then the storage service112-1 may scan a local copy of the RDMA table 122, stored in the mainmemory 104-1, to determine whether there is a cache hit for therequested data. If there is a cache hit, then that means the requesteddata has been loaded in the PMEM cache 121 and the storage service 112-1may generate an RDMA access request for the requested data. If however,there is no cache hit for the requested data in the local copy of theRDMA table 122, then that means the requested data is not currentlystored in the PMEM cache 121, and the storage process 125 may generate anon-RDMA access request and send the non-RDMA access request to thestorage process 125.

Operations described in FIG. 3 are described in the context of a readrequest however, the operations may also apply to a request to writedata to a data block. At block 302, a request, from a client, to readdata from a remote node is received. In an embodiment, the storageprocess 125 receives a request to read data from a data block stored inpersistent storage 129 of the storage cell 120. For example, DB process105-1 a may have initiated a data read request for a data block storedwithin persistent storage 129. The storage service 112-1 on computingnode 102-1 may have received the request and generated a non-RDMA accessrequest for the requested data. The non-RDMA access request may include,information identifying the requesting database and process, such asidentifying that DB process 105-1 a of instance 103-1 a is requesting toread data for DB1 in cluster 1. The requested data may be identifiedusing its “home location” corresponding to the storage location of therequested data in persistent storage 129.

At decision diamond 304, it is determined whether the client isassociated with existing access credentials of an existing cache group.In an embodiment, storage process 125 determines whether the client isassociated with an existing protection domain for an existing cachegroup by searching the cache group data structure that maintainsclients, cache groups, and their associated protection domainidentifiers. For example, the cache group data structure may be a table,which the storage process 125 queries using the cluster ID and databaseID of the client. If the client is not listed in the cache group datastructure table, then the storage process 125 may proceed to block 306to assign a cache group to the client. If however, the client is listedin the cache group data structure table, then that means the client isassociated with a cache group that may have assigned cache lines in thePMEM cache 121. In this scenario, the storage process 125 proceeds todecision diamond 308 to determine whether a copy of the data is storedwithin the PMEM cache 121.

At block 306, a cache group is assigned to the client. In an embodiment,if the client is not listed in the cache group data structure table,then the storage process 125 assigns a cache group to the client basedon which cluster and database the client represents. For example, if therequest originates from DB process 105-1 a, which is a DB process forDB1 in cluster 1, then the storage process 125 may generate a new cachegroup for the client and assign the client a protection domainidentifier that is derived from the cluster and database of the client,such as C1.DBA. The protection domain identifier represents anidentifier for the cache group. The storage process 125 may insert arecord into the cache group data structure table that includes the newcache group and the cluster and database identifiers for the client. Inother embodiments, the protection domain identifier may be derived fromthe cluster and database identifiers but may be represented as a numericvalue, alpha numeric value, or a hash value computed from the clusterand database identifiers.

In an embodiment, if the client represents a database within a clusterthat shares content with other databases within the cluster and does notspecify a separate cache management policy, then the storage process 125may add the client to an existing cache group, where the existing cachegroup represents clients within the same cluster that share the samelevel of security between them. For example, if the client representsDB1 in cluster 1 and an existing cache group contains DB2 in cluster 1,where both DB1 and DB2 share common tables and have the same level ofsecurity and implement a default cache management policy, then thestorage process 125 may add the client (DB1) to the cache group thatcontains DB2. The storage process 125 may insert a new record into thecache group data structure table that includes the existing cache groupand the cluster and database identifiers for the client, such as cluster1 and DB1.

Upon assigning a cache group to the client, the storage process 125registers a memory region of cache lines in the PMEM cache 121 to thenew cache group of the client. At block 310, a memory region in the PMEMcache 121 is assigned to the new cache group of the client. In anembodiment, the storage process 125 registers a new memory region to thenew cache group by identifying a free memory region in the PMEM cache121. The storage process 125 may request from the cache manager 126 afree memory region of cache lines to assign to the new cache group. Thecache manager 126 may allocate previously unallocated contiguous cachelines to be a new memory region in the PMEM cache 121. Alternatively, ifthere are no unallocated cache lines in the PMEM cache 121, then thecache manager 126 may identify a set of allocated cache lines that maybe reassigned based upon the respective cache management policies ofother cache groups. The cache manager 126 may use an LRU algorithm toidentify a contiguous set of cache lines currently allocated to a cachegroup that has a cache management policy that allows for reassignment ofcache lines to another cache group.

In an embodiment, (shown by the dashed arrow from block 306 to block312) the client was assigned to an existing cache group (at block 306),the storage process 125 may proceed directly to block 312 to load therequested data into a memory region of cache lines, provided that theexisting cache group has available cache lines already allocated. Forexample, if the existing cache group had a memory region of cache linesthat was only 50% utilized, then the storage process 125 may simply loadthe requested data into the free cache lines without the need ofregistering a new memory region to the existing cache line.

At block 312, the requested data is loaded into cache lines in a memoryregion of the PMEM cache 121. In an embodiment, the storage process 125may use the provided home location of the requested data, from theinitial request, to find the data in persistent storage 129 and load thedata into the PMEM cache 121 using the cache locations specified by thefree memory region allocated to the cache group of the client. In anembodiment, upon loading the requested data into cache lines in the PMEMcache 121, the storage process 125 updates the records in the RDMA table122 to reflect the loading of the requested data into specific cachelines in the PMEM cache 121. Additionally, the storage process 125 mayalso load the requested data into the volatile buffer pool 128, suchthat it may be directly delivered to the client.

Referring back to decision diamond 304, if the storage process 125determines that the client is already associated with a cache group,then the storage process 125 may proceed to decision diamond 308. Atdecision diamond 308, it is determined whether a copy of the requesteddata currently exists in the PMEM cache 121. In an embodiment, thestorage process 125 queries the RDMA table 122 for the requested datausing the home location of the requested data and the protection domainof the client to authenticate access for the client. If a copy of therequested data is in the PMEM cache 121, then the RDMA table 122 wouldcontain a record indicating the cache location of the requested data inthe PMEM cache 121. Upon determining that a record in the RDMA table 122exists for the requested data, the storage process 125 proceeds to block314 to retrieve a copy of the data from the PMEM cache 121. If however,the RDMA table 122 does not contain a record indicating that a copy ofthe requested data exists in the PMEM cache 121, then the storageprocess 125 proceeds to block 312 to load the requested data into thePMEM cache 121.

At block 314, a copy of the requested data is retrieved from the PMEMcache 121. In an embodiment, the storage process 125, upon determiningthat a record in the RDMA table 122 exists, identifies the cachelocation of the requested data from the RDMA table 122 and uses thecache location to retrieve the copy of the data from the PMEM cache 121.

At block 316, the copy of the data is sent to the client. In anembodiment, the storage process 125 generates and sends a messagecontaining the copy of the requested data to the requesting client. Thestorage service 112-1 on the computing node 102-1 may receive themessage, store the copy of the data within the database buffer pool108-1, and provide the local address of the copy of the requested datato the DB process 105-1 a. In an embodiment, the message sent to thecomputing node 102-1 by the storage process 125 may also contain tableconfiguration information and RDMA connection information to thecomputing node 102-1, such that subsequent requests for data may be madeusing an RDMA access request. The table configuration information mayinclude the RDMA table 122 address location and a local copy of therecords in the RDMA table assigned to the client.

RDMA Access Request

FIG. 4 is a flow chart depicting operations performed upon receiving anRDMA access request from a client, according to an embodiment. Accordingto at least one embodiment, RDMA access requests may be generated if theclient has previously established an RDMA connection between the clientand the storage cell 120 and a copy of the requested data is loaded inthe PMEM cache 121. Receiving an RDMA access request operation isillustrated using DB process 105-1 a on computing node 102-1 and storageprocess 125 on storage cell 120. DB process 105-1 a may initiate a reador write operation to either read a data block or write to a data block.Storage service 112-1 may receive a request from DB process 105-1 a anddetermine the home location for the data block searching a datadictionary, stored in main memory 104-1. Once the home location for thedata block is determined, storage service 112-1 may determine whetherthe requested data may be accessed using an RDMA access request. Asdescribed, in order for an RDMA access request to be issued, an activeRDMA connection needs to be available between computing node 102-1 andstorage cell 120 and a copy of the requested data needs to be loaded inthe PMEM cache 121. The storage service 112-1 may check whether anactive RDMA connection exists by checking the table configurationinformation stored in main memory 104-1. If an active RDMA connectionexists then the storage service 112-1 may scan a local copy of theportion of the RDMA table 122, stored in the main memory 104-1, todetermine whether there is a cache hit for the requested data. If thereis a cache hit, then that means the requested data has been loaded inthe PMEM cache 121 and the storage service 112-1 may generate an RDMAaccess request for the requested data. As described, the RDMA accessrequest generated by computing node 102-1 contains at least a remote keythat corresponds to the cache line that contains the requested data andthe protection domain identifier. The protection domain identifier isderived based upon the database cluster identifier and database ID.

Operations described in FIG. 4 are described in the context of a readrequest however, the operations may also apply to a request to writedata to a data block. At block 402 the storage cell receives an RDMAaccess request, from a client computing node. In an embodiment, thenetwork adapter 124 of the storage cell 120 receives the RDMA accessrequest. The RDMA access request contains a protection domain identifierfor the requesting client, the corresponding remote key for the cacheline that contains the requested data, and a home location address forthe requested data. As described, the network adapter 124 is an RNICwhich supports an RDMA protocol, such that the network adapter 124 isconfigured to receive RDMA access requests from a computing node,determine whether the requestor has appropriate permissions to accessthe requested data, and access copies of data stored in the PMEM cache121.

At decision diamond 404, it is determined whether the client haspermissions to perform an RDMA access of the cache line that containsthe copy of the requested data based on access credentials and theremote key. Access credentials may represent the protection domainidentifier for the requesting client. In an embodiment, the networkadapter 124 queries the RDMA table 122 to determine whether theprotection domain identifier associated with the cache line matches theprotection domain identifier in the RDMA access request for the remotekey specified in the access request. If the protection domain identifierassociated with the cache line, identified using the remote key, doesnot match the protection domain identifier in the RDMA access requestthen the client does not have permission to access the copy of the datain the cache line. The process would then proceed to block 410. At block410, the network adapter 124 terminates the RDMA access request andterminates the RDMA connection between the computing node 102-1 and thestorage cell 120. Termination of the RDMA connection may occur to ensurethat clients, using invalid credentials, have their RDMA connectioninformation and table configuration information reset. Scenarios inwhich clients have invalid credentials may include, but are not limitedto, the PMEM cache 121 reusing existing cache lines and clientsintentionally trying to access unauthorized data. In one example, thecache manager 126 may reassign cache lines to other clients that needadditional cache space within the PMEM cache 121. If this occurs, thenthe storage process 125 may update the records in the RDMA table 122 toreflect the reassignment of cache lines. Updated copies of the portionsof the RDMA table 122 may then be sent to clients. However, if an RDMAaccess request is generated by a client, prior to receiving the updatedcopy of the portion of the RDMA table 122, then the RDMA access requestmay specify an invalid protection domain identifier and/or remote key.As a result, the network adapter 124, upon determining that the clientdoes not have the correct permissions for the requested cache line, mayterminate the RDMA connection in order to trigger updated RDMAconnection information and table configuration information to be sent tothe client.

In another example, if the client is purposely trying to access anunauthorized cache line, then, as a security measure, the networkadapter 124 may terminate the RDMA connection. Referring to FIG. 2A, ifa client associated with protection domain C1.DB3, which is authorizedto access the cache line referenced by record 202-N attempts to accessthe cache line represented by record 202-1, which is assigned to anotherclient (protection domain C1.DB1), then this RDMA access request wouldbe identified as an unauthorized request and the RDMA connection betweenthe unauthorized client and the storage cell 120 would be terminated atblock 410.

If however, at decision diamond 404, the network adapter 124 determinesthat the cache line protection domain identifier matches the protectiondomain identifier in the RDMA access request, then the client haspermission to access the copy of the data in the cache line. Sinceprotection domain identifiers are based upon a cluster identifier and adatabase identifier, different instances of a database instantiated ondifferent computing nodes would have access to the same set of cachelines within the PMEM cache 121. For example, instance 103-2 a runningDB process 105-2 a on computing node 102-2, which represents database 1on cluster 1, would have the same protection domain identifier asinstance 103-1 a running DB process 105-1 a on computing node 102-1. Asa result, RDMA access requests originating from requests from eitherinstance 103-1 a or 103-2 a would have authorization to access the cacheline corresponding to record 202-1, which has a protection domainidentifier as C1.DB1.

At block 406, the network adapter 124 processes the RDMA access requestby performing a direct memory access of the target cache line in thePMEM cache 121. Specifically, the network adapter 124 retrieves a copyof the target cache line from the PMEM cache 121.

At block 408, an RDMA response message that includes the copy of therequested data is sent to the client. In an embodiment, the networkadapter 124 generates a response message that includes the copy of therequested data and sends the response message back to the computing node102-1 via the RDMA connection queues.

The response message is then received by the network adapter 109-1 oncomputing node 102-1. The network adapter 109-1 then loads the copy ofthe requested data into the database buffer pool 108-1. The DB process105-1 a may then access the requested data.

Hardware Overview

According to one embodiment, the techniques described herein areimplemented by one or more special-purpose computing devices. Thespecial-purpose computing devices may be hard-wired to perform thetechniques, or may include digital electronic devices such as one ormore application-specific integrated circuits (ASICs) or fieldprogrammable gate arrays (FPGAs) that are persistently programmed toperform the techniques, or may include one or more general purposehardware processors programmed to perform the techniques pursuant toprogram instructions in firmware, memory, other storage, or acombination. Such special-purpose computing devices may also combinecustom hard-wired logic, ASICs, or FPGAs with custom programming toaccomplish the techniques. The special-purpose computing devices may bedesktop computer systems, portable computer systems, handheld devices,networking devices or any other device that incorporates hard-wiredand/or program logic to implement the techniques.

For example, FIG. 5 is a block diagram that illustrates a computersystem 500 upon which an embodiment of the invention may be implemented.Computer system 500 includes a bus 502 or other communication mechanismfor communicating information, and a hardware processor 504 coupled withbus 502 for processing information. Hardware processor 504 may be, forexample, a general purpose microprocessor.

Computer system 500 also includes a main memory 506, such as a randomaccess memory (RAM) or other dynamic storage device, coupled to bus 502for storing information and instructions to be executed by processor504. Main memory 506 also may be used for storing temporary variables orother intermediate information during execution of instructions to beexecuted by processor 504. Such instructions, when stored innon-transitory storage media accessible to processor 504, rendercomputer system 500 into a special-purpose machine that is customized toperform the operations specified in the instructions.

Computer system 500 further includes a read only memory (ROM) 508 orother static storage device coupled to bus 502 for storing staticinformation and instructions for processor 504. A storage device 510,such as a magnetic disk, optical disk, or solid-state drive is providedand coupled to bus 502 for storing information and instructions.

Computer system 500 may be coupled via bus 502 to a display 512, such asa cathode ray tube (CRT), for displaying information to a computer user.An input device 514, including alphanumeric and other keys, is coupledto bus 502 for communicating information and command selections toprocessor 504. Another type of user input device is cursor control 516,such as a mouse, a trackball, or cursor direction keys for communicatingdirection information and command selections to processor 504 and forcontrolling cursor movement on display 512. This input device typicallyhas two degrees of freedom in two axes, a first axis (e.g., x) and asecond axis (e.g., y), that allows the device to specify positions in aplane.

Computer system 500 may implement the techniques described herein usingcustomized hard-wired logic, one or more ASICs or FPGAs, firmware and/orprogram logic which in combination with the computer system causes orprograms computer system 500 to be a special-purpose machine. Accordingto one embodiment, the techniques herein are performed by computersystem 500 in response to processor 504 executing one or more sequencesof one or more instructions contained in main memory 506. Suchinstructions may be read into main memory 506 from another storagemedium, such as storage device 510. Execution of the sequences ofinstructions contained in main memory 506 causes processor 504 toperform the process steps described herein. In alternative embodiments,hard-wired circuitry may be used in place of or in combination withsoftware instructions.

The term “storage media” as used herein refers to any non-transitorymedia that store data and/or instructions that cause a machine tooperate in a specific fashion. Such storage media may comprisenon-volatile media and/or volatile media. Non-volatile media includes,for example, optical disks, magnetic disks, or solid-state drives, suchas storage device 510. Volatile media includes dynamic memory, such asmain memory 506. Common forms of storage media include, for example, afloppy disk, a flexible disk, hard disk, solid-state drive, magnetictape, or any other magnetic data storage medium, a CD-ROM, any otheroptical data storage medium, any physical medium with patterns of holes,a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip orcartridge.

Storage media is distinct from but may be used in conjunction withtransmission media. Transmission media participates in transferringinformation between storage media. For example, transmission mediaincludes coaxial cables, copper wire and fiber optics, including thewires that comprise bus 502. Transmission media can also take the formof acoustic or light waves, such as those generated during radio-waveand infra-red data communications.

Various forms of media may be involved in carrying one or more sequencesof one or more instructions to processor 504 for execution. For example,the instructions may initially be carried on a magnetic disk orsolid-state drive of a remote computer. The remote computer can load theinstructions into its dynamic memory and send the instructions over atelephone line using a modem. A modem local to computer system 500 canreceive the data on the telephone line and use an infra-red transmitterto convert the data to an infra-red signal. An infra-red detector canreceive the data carried in the infra-red signal and appropriatecircuitry can place the data on bus 502. Bus 502 carries the data tomain memory 506, from which processor 504 retrieves and executes theinstructions. The instructions received by main memory 506 mayoptionally be stored on storage device 510 either before or afterexecution by processor 504.

Computer system 500 also includes a communication interface 518 coupledto bus 502. Communication interface 518 provides a two-way datacommunication coupling to a network link 520 that is connected to alocal network 522. For example, communication interface 518 may be anintegrated services digital network (ISDN) card, cable modem, satellitemodem, or a modem to provide a data communication connection to acorresponding type of telephone line. As another example, communicationinterface 518 may be a local area network (LAN) card to provide a datacommunication connection to a compatible LAN. Wireless links may also beimplemented. In any such implementation, communication interface 518sends and receives electrical, electromagnetic or optical signals thatcarry digital data streams representing various types of information.

Network link 520 typically provides data communication through one ormore networks to other data devices. For example, network link 520 mayprovide a connection through local network 522 to a host computer 524 orto data equipment operated by an Internet Service Provider (ISP) 526.ISP 526 in turn provides data communication services through the worldwide packet data communication network now commonly referred to as the“Internet” 528. Local network 522 and Internet 528 both use electrical,electromagnetic or optical signals that carry digital data streams. Thesignals through the various networks and the signals on network link 520and through communication interface 518, which carry the digital data toand from computer system 500, are example forms of transmission media.

Computer system 500 can send messages and receive data, includingprogram code, through the network(s), network link 520 and communicationinterface 518. In the Internet example, a server 530 might transmit arequested code for an application program through Internet 528, ISP 526,local network 522 and communication interface 518.

The received code may be executed by processor 504 as it is received,and/or stored in storage device 510, or other non-volatile storage forlater execution.

In the foregoing specification, embodiments of the invention have beendescribed with reference to numerous specific details that may vary fromimplementation to implementation. The specification and drawings are,accordingly, to be regarded in an illustrative rather than a restrictivesense. The sole and exclusive indicator of the scope of the invention,and what is intended by the applicants to be the scope of the invention,is the literal and equivalent scope of the set of claims that issue fromthis application, in the specific form in which such claims issue,including any subsequent correction.

Software Overview

FIG. 6 is a block diagram of a basic software system 600 that may beemployed for controlling the operation of computer system 500. Softwaresystem 600 and its components, including their connections,relationships, and functions, is meant to be exemplary only, and notmeant to limit implementations of the example embodiment(s). Othersoftware systems suitable for implementing the example embodiment(s) mayhave different components, including components with differentconnections, relationships, and functions.

Software system 600 is provided for directing the operation of computersystem 500. Software system 600, which may be stored in system memory(RAM) 506 and on fixed storage (e.g., hard disk or flash memory) 510,includes a kernel or operating system (OS) 610.

The OS 610 manages low-level aspects of computer operation, includingmanaging execution of processes, memory allocation, file input andoutput (I/O), and device I/O. One or more application programs,represented as 602A, 602B, 602C . . . 602N, may be “loaded” (e.g.,transferred from fixed storage 510 into memory 506) for execution by thesystem 600. The applications or other software intended for use oncomputer system 500 may also be stored as a set of downloadablecomputer-executable instructions, for example, for downloading andinstallation from an Internet location (e.g., a Web server, an appstore, or other online service).

Software system 600 includes a graphical user interface (GUI) 615, forreceiving user commands and data in a graphical (e.g., “point-and-click”or “touch gesture”) fashion. These inputs, in turn, may be acted upon bythe system 600 in accordance with instructions from operating system 610and/or application(s) 602. The GUI 615 also serves to display theresults of operation from the OS 610 and application(s) 602, whereuponthe user may supply additional inputs or terminate the session (e.g.,log off).

OS 610 can execute directly on the bare hardware 620 (e.g., processor(s)504) of computer system 500. Alternatively, a hypervisor or virtualmachine monitor (VMM) 630 may be interposed between the bare hardware620 and the OS 610. In this configuration, VMM 630 acts as a software“cushion” or virtualization layer between the OS 610 and the barehardware 620 of the computer system 500.

VMM 630 instantiates and runs one or more virtual machine instances(“guest machines”). Each guest machine comprises a “guest” operatingsystem, such as OS 610, and one or more applications, such asapplication(s) 602, designed to execute on the guest operating system.The VMM 630 presents the guest operating systems with a virtualoperating platform and manages the execution of the guest operatingsystems.

In some instances, the VMM 630 may allow a guest operating system to runas if it is running on the bare hardware 620 of computer system 500directly. In these instances, the same version of the guest operatingsystem configured to execute on the bare hardware 620 directly may alsoexecute on VMM 630 without modification or reconfiguration. In otherwords, VMM 630 may provide full hardware and CPU virtualization to aguest operating system in some instances.

In other instances, a guest operating system may be specially designedor configured to execute on VMM 630 for efficiency. In these instances,the guest operating system is “aware” that it executes on a virtualmachine monitor. In other words, VMM 630 may provide para-virtualizationto a guest operating system in some instances.

A computer system process comprises an allotment of hardware processortime, and an allotment of memory (physical and/or virtual), theallotment of memory being for storing instructions executed by thehardware processor, for storing data generated by the hardware processorexecuting the instructions, and/or for storing the hardware processorstate (e.g. content of registers) between allotments of the hardwareprocessor time when the computer system process is not running. Computersystem processes run under the control of an operating system, and mayrun under the control of other programs being executed on the computersystem.

Cloud Computing

The term “cloud computing” is generally used herein to describe acomputing model which enables on-demand access to a shared pool ofcomputing resources, such as computer networks, servers, softwareapplications, and services, and which allows for rapid provisioning andrelease of resources with minimal management effort or service providerinteraction.

A cloud computing environment (sometimes referred to as a cloudenvironment, or a cloud) can be implemented in a variety of differentways to best suit different requirements. For example, in a public cloudenvironment, the underlying computing infrastructure is owned by anorganization that makes its cloud services available to otherorganizations or to the general public. In contrast, a private cloudenvironment is generally intended solely for use by, or within, a singleorganization. A community cloud is intended to be shared by severalorganizations within a community; while a hybrid cloud comprises two ormore types of cloud (e.g., private, community, or public) that are boundtogether by data and application portability.

Generally, a cloud computing model enables some of thoseresponsibilities which previously may have been provided by anorganization's own information technology department, to instead bedelivered as service layers within a cloud environment, for use byconsumers (either within or external to the organization, according tothe cloud's public/private nature). Depending on the particularimplementation, the precise definition of components or featuresprovided by or within each cloud service layer can vary, but commonexamples include: Software as a Service (SaaS), in which consumers usesoftware applications that are running upon a cloud infrastructure,while a SaaS provider manages or controls the underlying cloudinfrastructure and applications. Platform as a Service (PaaS), in whichconsumers can use software programming languages and development toolssupported by a PaaS provider to develop, deploy, and otherwise controltheir own applications, while the PaaS provider manages or controlsother aspects of the cloud environment (i.e., everything below therun-time execution environment). Infrastructure as a Service (IaaS), inwhich consumers can deploy and run arbitrary software applications,and/or provision processing, storage, networks, and other fundamentalcomputing resources, while an IaaS provider manages or controls theunderlying physical cloud infrastructure (i.e., everything below theoperating system layer). Database as a Service (DBaaS) in whichconsumers use a database server or Database Management System that isrunning upon a cloud infrastructure, while a DbaaS provider manages orcontrols the underlying cloud infrastructure, applications, and servers,including one or more database servers.

EXTENSIONS AND ALTERNATIVES

Although some of the figures described in the foregoing specificationinclude flow diagrams with steps that are shown in an order, the stepsmay be performed in any order, and are not limited to the order shown inthose flowcharts. Additionally, some steps may be optional, may beperformed multiple times, and/or may be performed by differentcomponents. All steps, operations and functions of a flow diagram thatare described herein are intended to indicate operations that areperformed using programming in a special-purpose computer orgeneral-purpose computer, in various embodiments. In other words, eachflow diagram in this disclosure, in combination with the related textherein, is a guide, plan or specification of all or part of an algorithmfor programming a computer to execute the functions that are described.The level of skill in the field associated with this disclosure is knownto be high, and therefore the flow diagrams and related text in thisdisclosure have been prepared to convey information at a level ofsufficiency and detail that is normally expected in the field whenskilled persons communicate among themselves with respect to programs,algorithms and their implementation. In the foregoing specification, theexample embodiment(s) of the present invention have been described withreference to numerous specific details. However, the details may varyfrom implementation to implementation according to the requirements ofthe particular implement at hand. The example embodiment(s) are,accordingly, to be regarded in an illustrative rather than a restrictivesense.

What is claimed is:
 1. A method comprising: a storage server maintaininga persistent memory (PMEM) cache comprising a plurality of cache lines,each cache of said cache line caching an allocation unit of ablock-based storage managed by said storage server; the storage servermaintaining a Remote Direct Memory Access (RDMA) table that includes aplurality of table entries, each table entry of said plurality tableentries: mapping a respective storage server client of a plurality ofstorage server clients to one or more cache lines of said plurality ofcache lines that cache an allocation unit for said each first storageclient, and including a respective remote access key value for accessingsaid one or more cache lines using RDMA access; receiving, from aparticular storage server client of said plurality of storage serverclients, a RDMA request to access a particular cache line in said PMEMcache, said request including a particular remote access key value froma particular table entry of said plurality of table entries for saidparticular cache line; identifying, by the storage server, accesscredentials associated with the particular storage server client;determining whether the particular storage server client has permissionto perform an RDMA access to the particular cache line based on theaccess credentials and the particular remote access key value; and upondetermining that the particular storage server client has permission toperform the RDMA access, accessing and sending the particular cache linefrom said PMEM cache to the particular storage server client.
 2. Themethod of claim 1, further comprising, upon determining that theparticular storage server client does not have permission to perform theRDMA access, terminating a connection between the particular storageserver client and the storage server.
 3. The method of claim 1, whereinthe RDMA request to access the particular cache line in said PMEM cachefrom the particular storage server client, is received via an RDMAconnection between a first host channel adapter on the particularstorage server client and a second host channel adapter on the storageserver.
 4. The method of claim 1, wherein the RDMA table is storedwithin the PMEM cache.
 5. The method of claim 1, wherein the RDMArequest to access the particular cache line originated from a databaseprocess for a particular database cluster and particular databaseinstantiated on the particular storage server client; and wherein theaccess credentials are based on the particular database cluster and theparticular database.
 6. The method of claim 1, wherein determiningwhether the particular storage server client has permission to performthe RDMA access to the particular cache line, comprises: accessing, inthe RDMA table, a record associated with the particular cache line usingthe remote access key value; determining whether access credentialsassociated with the record match the access credentials associated withthe particular storage server client.
 7. The method of claim 6, whereindetermining whether the particular storage server client has permissionto perform the RDMA access to the particular cache line is performed bya host channel adapter on the storage server.
 8. The method of claim 1,further comprising: prior to receiving the RDMA request to access theparticular cache line, receiving, from the particular storage serverclient, a request to access one or more data blocks stored on theparticular storage server; identifying, by the storage server, theaccess credentials associated with the particular storage server client;determining whether a copy of the one or more data blocks is stored inthe PMEM cache; upon determining that the copy of the one or more datablocks are not stored in the PMEM cache: allocating, by the storageserver, one or more cache lines in the PMEM cache for storing the copyof the one or more data blocks; associating, by the storage server, theone or more cache lines to the particular storage server client;loading, by the storage server, the copy of the one or more data blocksinto the one or more cache lines; sending the copy of the one or moredata blocks to the particular storage server client.
 9. The method ofclaim 1, further comprising: prior to receiving the RDMA request toaccess the particular cache line: receiving, from the particular storageserver client, a request to access one or more data blocks stored on theparticular storage server; determining whether the particular storageserver client is associated with the access credentials; upondetermining that the particular storage server client is not associatedwith the access credentials, generating, by the storage server, theaccess credentials for the particular storage server client based on aparticular database cluster and a particular database associated withthe particular storage server client; allocating, by the storage server,one or more cache lines in the PMEM cache for storing a copy of the oneor more data blocks; associating, by the storage server, the one or morecache lines to the particular storage server client; loading, by thestorage server, the copy of the one or more data blocks into the one ormore cache lines; sending the copy of the one or more data blocks to theparticular storage server client.
 10. The method of claim 9, whereingenerating the access credentials for the particular storage serverclient comprises: generating an access credential value based theparticular database cluster and the particular database associated withthe particular storage server client; storing the access credentialvalue in a data structure loaded in volatile memory and associated withthe RDMA table.
 11. One or more non-transitory storage media storinginstructions which, when executed by one or more computing devices,cause operations comprising: maintaining a persistent memory (PMEM)cache comprising a plurality of cache lines, each cache of said cacheline caching an allocation unit of a block-based storage managed by saidstorage server; maintaining a Remote Direct Memory Access (RDMA) tablethat includes a plurality of table entries, each table entry of saidplurality table entries: mapping a respective storage server client of aplurality of storage server clients to one or more cache lines of saidplurality of cache lines that caches an allocation unit for said eachfirst storage client, and including a respective remote access key valuefor accessing said one or more cache lines using RDMA access; receiving,from a particular storage server client of said plurality of storageserver clients, a RDMA request to access a particular cache line in saidPMEM cache, said request including a particular remote access key valuefrom a particular table entry of said plurality of table entries forsaid particular cache line; identifying access credentials associatedwith the particular storage server client; determining whether theparticular storage server client has permission to perform an RDMAaccess to the particular cache line based on the access credentials andthe particular remote access key value; and upon determining that theparticular storage server client has permission to perform the RDMAaccess, accessing and sending the particular cache line from said PMEMcache to the particular storage server client.
 12. The non-transitorycomputer-readable media of claim 11, the operations further comprising,upon determining that the particular storage server client does not havepermission to perform the RDMA access, terminating a connection betweenthe particular storage server client and the storage server.
 13. Thenon-transitory computer-readable media of claim 11, wherein the RDMArequest to access the particular cache line in said PMEM cache from theparticular storage server client, is received via an RDMA connectionbetween a first host channel adapter on the particular storage serverclient and a second host channel adapter on the storage server.
 14. Thenon-transitory computer-readable media of claim 11, wherein the RDMAtable is stored within the PMEM cache.
 15. The non-transitorycomputer-readable media of claim 11, wherein the RDMA request to accessthe particular cache line originated from a database process for aparticular database cluster and particular database instantiated on theparticular storage server client; and wherein the access credentials arebased on the particular database cluster and the particular database.16. The non-transitory computer-readable media of claim 11, whereindetermining whether the particular storage server client has permissionto perform the RDMA access to the particular cache line, comprises:accessing, in the RDMA table, a record associated with the particularcache line using the remote access key value; determining whether accesscredentials associated with the record match the access credentialsassociated with the particular storage server client.
 17. Thenon-transitory computer-readable media of claim 16, wherein determiningwhether the particular storage server client has permission to performthe RDMA access to the particular cache line is performed by a hostchannel adapter on the storage server.
 18. non-transitorycomputer-readable media of claim 11, the operations further comprising:prior to receiving the RDMA request to access the particular cache line,receiving, from the particular storage server client, a request toaccess one or more data blocks stored on the particular storage server;identifying, by the storage server, the access credentials associatedwith the particular storage server client; determining whether a copy ofthe one or more data blocks is stored in the PMEM cache; upondetermining that the copy of the one or more data blocks are not storedin the PMEM cache: allocating, by the storage server, one or more cachelines in the PMEM cache for storing the copy of the one or more datablocks; associating, by the storage server, the one or more cache linesto the particular storage server client; loading, by the storage server,the copy of the one or more data blocks into the one or more cachelines; sending the copy of the one or more data blocks to the particularstorage server client.
 19. The non-transitory computer-readable media ofclaim 11, the operations further comprising: prior to receiving the RDMArequest to access the particular cache line, receiving, from theparticular storage server client, a request to access one or more datablocks stored on the particular storage server; determining whether theparticular storage server client is associated with the accesscredentials; upon determining that the particular storage server clientis not associated with the access credentials, generating, by thestorage server, the access credentials for the particular storage serverclient based on a particular database cluster and a particular databaseassociated with the particular storage server client; allocating, by thestorage server, one or more cache lines in the PMEM cache for storing acopy of the one or more data blocks; associating, by the storage server,the one or more cache lines to the particular storage server client;loading, by the storage server, the copy of the one or more data blocksinto the one or more cache lines; sending the copy of the one or moredata blocks to the particular storage server client.
 20. Thenon-transitory computer-readable media of claim 19, wherein generatingthe access credentials for the particular storage server clientcomprises: generating an access credential value based the particulardatabase cluster and the particular database associated with theparticular storage server client; storing the access credential value ina data structure loaded in volatile memory and associated with the RDMAtable.